{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "graffitiCellId": "id_635azvh",
    "id": "B4BA5B0FCB884B6DBA6FFE071318505E",
    "jupyter": {},
    "mdEditEnable": false,
    "slideshow": {
     "slide_type": "slide"
    },
    "tags": []
   },
   "source": [
    "# 文本预处理\n",
    "\n",
    "\n",
    "文本是一类序列数据，一篇文章可以看作是字符或单词的序列，本节将介绍文本数据的常见预处理步骤，预处理通常包括四个步骤：\n",
    "\n",
    "1. 读入文本\n",
    "2. 分词\n",
    "3. 建立字典，将每个词映射到一个唯一的索引（index）\n",
    "4. 将文本从词的序列转换为索引的序列，方便输入模型"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "graffitiCellId": "id_da72pg7",
    "id": "C3B3705CB33847A895AD10619F23E97B",
    "jupyter": {},
    "mdEditEnable": false,
    "slideshow": {
     "slide_type": "slide"
    },
    "tags": []
   },
   "source": [
    "## 读入文本\n",
    "\n",
    "我们用一部英文小说，即H. G. Well的[Time Machine](http://www.gutenberg.org/ebooks/35)，作为示例，展示文本预处理的具体过程。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "graffitiCellId": "id_ytfpat1",
    "id": "7FA4C53DED4F42279EA3AB3229B88DB7",
    "jupyter": {},
    "scrolled": false,
    "slideshow": {
     "slide_type": "slide"
    },
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "# sentences 29\n"
     ]
    }
   ],
   "source": [
    "import collections\n",
    "import re\n",
    "\n",
    "def read_time_machine():\n",
    "    with open('D:\\Anaconda3\\Anaconda3_jupyter_documents\\动手学深度学习PyTorch\\Task2-RNN/timemachine.txt', 'r') as f:\n",
    "        lines = [re.sub('[^a-z]+', ' ', line.strip().lower()) for line in f]\n",
    "    return lines\n",
    "\n",
    "\n",
    "lines = read_time_machine()\n",
    "print('# sentences %d' % len(lines))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "graffitiCellId": "id_gy3tram",
    "id": "EABE813C62FC4E1B8DFFD7B819C31829",
    "jupyter": {},
    "mdEditEnable": false,
    "slideshow": {
     "slide_type": "slide"
    },
    "tags": []
   },
   "source": [
    "## 分词\n",
    "\n",
    "我们对每个句子进行分词，也就是将一个句子划分成若干个词（token），转换为一个词的序列。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "graffitiCellId": "id_z5grfxp",
    "id": "F80F8AFC1C0A48BDB66D52A18DC3A940",
    "jupyter": {},
    "scrolled": false,
    "slideshow": {
     "slide_type": "slide"
    },
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[['the',\n",
       "  'time',\n",
       "  'traveller',\n",
       "  'for',\n",
       "  'so',\n",
       "  'it',\n",
       "  'will',\n",
       "  'be',\n",
       "  'convenient',\n",
       "  'to',\n",
       "  'speak',\n",
       "  'of',\n",
       "  'him',\n",
       "  'was'],\n",
       " ['expounding',\n",
       "  'a',\n",
       "  'recondite',\n",
       "  'matter',\n",
       "  'to',\n",
       "  'us',\n",
       "  'his',\n",
       "  'grey',\n",
       "  'eyes',\n",
       "  'shone',\n",
       "  'and',\n",
       "  'twinkled',\n",
       "  '']]"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "def tokenize(sentences, token='word'):\n",
    "    \"\"\"Split sentences into word or char tokens\"\"\"\n",
    "    if token == 'word':\n",
    "        return [sentence.split(' ') for sentence in sentences]\n",
    "    elif token == 'char':\n",
    "        return [list(sentence) for sentence in sentences]\n",
    "    else:\n",
    "        print('ERROR: unkown token type '+token)\n",
    "\n",
    "tokens = tokenize(lines)\n",
    "tokens[0:2]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "graffitiCellId": "id_rap2ka4",
    "id": "01CE759264D84FAA8C60CE9156B86157",
    "jupyter": {},
    "mdEditEnable": false,
    "slideshow": {
     "slide_type": "slide"
    },
    "tags": []
   },
   "source": [
    "## 建立字典\n",
    "\n",
    "为了方便模型处理，我们需要将字符串转换为数字。因此我们需要先构建一个字典（vocabulary），将每个词映射到一个唯一的索引编号。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "attributes": {
     "classes": [],
     "id": "",
     "n": "9"
    },
    "graffitiCellId": "id_wapwqkb",
    "id": "37532FBF89C242A1805534BBE05C343A",
    "jupyter": {},
    "scrolled": false,
    "slideshow": {
     "slide_type": "slide"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "class Vocab(object):\n",
    "    def __init__(self, tokens, min_freq=0, use_special_tokens=False):\n",
    "        counter = count_corpus(tokens)  # : \n",
    "        self.token_freqs = list(counter.items())\n",
    "        self.idx_to_token = []\n",
    "        if use_special_tokens:\n",
    "            # padding, begin of sentence, end of sentence, unknown\n",
    "            self.pad, self.bos, self.eos, self.unk = (0, 1, 2, 3)\n",
    "            self.idx_to_token += ['', '', '', '']\n",
    "        else:\n",
    "            self.unk = 0\n",
    "            self.idx_to_token += ['']\n",
    "        self.idx_to_token += [token for token, freq in self.token_freqs\n",
    "                        if freq >= min_freq and token not in self.idx_to_token]\n",
    "        self.token_to_idx = dict()\n",
    "        for idx, token in enumerate(self.idx_to_token):\n",
    "            self.token_to_idx[token] = idx\n",
    "\n",
    "    def __len__(self):\n",
    "        return len(self.idx_to_token)\n",
    "\n",
    "    def __getitem__(self, tokens):\n",
    "        if not isinstance(tokens, (list, tuple)):\n",
    "            return self.token_to_idx.get(tokens, self.unk)\n",
    "        return [self.__getitem__(token) for token in tokens]\n",
    "\n",
    "    def to_tokens(self, indices):\n",
    "        if not isinstance(indices, (list, tuple)):\n",
    "            return self.idx_to_token[indices]\n",
    "        return [self.idx_to_token[index] for index in indices]\n",
    "\n",
    "def count_corpus(sentences):\n",
    "    tokens = [tk for st in sentences for tk in st]\n",
    "    return collections.Counter(tokens)  # 返回一个字典，记录每个词的出现次数"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "graffitiCellId": "id_k17qg7x",
    "id": "DB6949BC67FF4C7481DFFD00FE64BE56",
    "jupyter": {},
    "mdEditEnable": false,
    "slideshow": {
     "slide_type": "slide"
    },
    "tags": []
   },
   "source": [
    "我们看一个例子，这里我们尝试用Time Machine作为语料构建字典"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "attributes": {
     "classes": [],
     "id": "",
     "n": "23"
    },
    "graffitiCellId": "id_hm9y6bm",
    "id": "1BE94FF518DB4C4A8CDFB95C0262B47E",
    "jupyter": {},
    "scrolled": false,
    "slideshow": {
     "slide_type": "slide"
    },
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[('', 0), ('the', 1), ('time', 2), ('traveller', 3), ('for', 4), ('so', 5), ('it', 6), ('will', 7), ('be', 8), ('convenient', 9)]\n"
     ]
    }
   ],
   "source": [
    "vocab = Vocab(tokens)\n",
    "print(list(vocab.token_to_idx.items())[0:10])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "graffitiCellId": "id_l6pjfl7",
    "id": "73D0F629056F41B59D4A53F15BFAB552",
    "jupyter": {},
    "mdEditEnable": false,
    "slideshow": {
     "slide_type": "slide"
    },
    "tags": []
   },
   "source": [
    "## 将词转为索引\n",
    "\n",
    "使用字典，我们可以将原文本中的句子从单词序列转换为索引序列"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "graffitiCellId": "id_k48bsl2",
    "id": "9FBB71C21B5C4F5283C70CFE3BB07112",
    "jupyter": {},
    "scrolled": false,
    "slideshow": {
     "slide_type": "slide"
    },
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "words: ['this', 'way', 'marking', 'the', 'points', 'with', 'a', 'lean', 'forefinger', 'as', 'we', 'sat', 'and', 'lazily']\n",
      "indices: [72, 73, 74, 1, 75, 76, 16, 77, 78, 79, 80, 56, 24, 81]\n",
      "words: ['admired', 'his', 'earnestness', 'over', 'this', 'new', 'paradox', 'as', 'we', 'thought', 'it', 'and', 'his']\n",
      "indices: [82, 20, 83, 84, 72, 85, 86, 79, 80, 64, 6, 24, 20]\n"
     ]
    }
   ],
   "source": [
    "for i in range(8, 10):\n",
    "    print('words:', tokens[i])\n",
    "    print('indices:', vocab[tokens[i]])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "graffitiCellId": "id_q6fupul",
    "id": "EA20BC8762A74B188D3FDD761BFEDE98",
    "jupyter": {},
    "mdEditEnable": false,
    "slideshow": {
     "slide_type": "slide"
    },
    "tags": []
   },
   "source": [
    "## 用现有工具进行分词\n",
    "\n",
    "我们前面介绍的分词方式非常简单，它至少有以下几个缺点:\n",
    "\n",
    "1. 标点符号通常可以提供语义信息，但是我们的方法直接将其丢弃了\n",
    "2. 类似“shouldn't\", \"doesn't\"这样的词会被错误地处理\n",
    "3. 类似\"Mr.\", \"Dr.\"这样的词会被错误地处理\n",
    "\n",
    "我们可以通过引入更复杂的规则来解决这些问题，但是事实上，有一些现有的工具可以很好地进行分词，我们在这里简单介绍其中的两个：[spaCy](https://spacy.io/)和[NLTK](https://www.nltk.org/)。\n",
    "\n",
    "下面是一个简单的例子："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "graffitiCellId": "id_7u3knll",
    "id": "46F5F57611E248ECB51F04BD0104E278",
    "jupyter": {},
    "scrolled": false,
    "slideshow": {
     "slide_type": "slide"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "text = \"Mr. Chen doesn't agree with my suggestion.\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "graffitiCellId": "id_ae3i5g2",
    "id": "7D5831E3D5AD4FF48155334F73065451",
    "jupyter": {},
    "mdEditEnable": false,
    "slideshow": {
     "slide_type": "slide"
    },
    "tags": []
   },
   "source": [
    "spaCy:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "graffitiCellId": "id_uz6civu",
    "id": "30D69C6B1BE44362BA556E2E5EEF493A",
    "jupyter": {},
    "scrolled": false,
    "slideshow": {
     "slide_type": "slide"
    },
    "tags": []
   },
   "outputs": [
    {
     "ename": "ModuleNotFoundError",
     "evalue": "No module named 'spacy'",
     "output_type": "error",
     "traceback": [
      "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[1;31mModuleNotFoundError\u001b[0m                       Traceback (most recent call last)",
      "\u001b[1;32m<ipython-input-15-bd289229cdd7>\u001b[0m in \u001b[0;36m<module>\u001b[1;34m\u001b[0m\n\u001b[1;32m----> 1\u001b[1;33m \u001b[1;32mimport\u001b[0m \u001b[0mspacy\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m      2\u001b[0m \u001b[0mnlp\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mspacy\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mload\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34m'en_core_web_sm'\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m      3\u001b[0m \u001b[0mdoc\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mnlp\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mtext\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m      4\u001b[0m \u001b[0mprint\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m[\u001b[0m\u001b[0mtoken\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mtext\u001b[0m \u001b[1;32mfor\u001b[0m \u001b[0mtoken\u001b[0m \u001b[1;32min\u001b[0m \u001b[0mdoc\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
      "\u001b[1;31mModuleNotFoundError\u001b[0m: No module named 'spacy'"
     ]
    }
   ],
   "source": [
    "import spacy\n",
    "nlp = spacy.load('en_core_web_sm')\n",
    "doc = nlp(text)\n",
    "print([token.text for token in doc])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "765CC9B5A1C348A58A2B340ADC532FD1",
    "jupyter": {},
    "mdEditEnable": false,
    "slideshow": {
     "slide_type": "slide"
    },
    "tags": []
   },
   "source": [
    "NLTK:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "graffitiCellId": "id_r13iwga",
    "id": "B83D30D3670B44A38527B4943BE4DBE0",
    "jupyter": {},
    "scrolled": false,
    "slideshow": {
     "slide_type": "slide"
    },
    "tags": []
   },
   "outputs": [
    {
     "ename": "LookupError",
     "evalue": "\n**********************************************************************\n  Resource \u001b[93mpunkt\u001b[0m not found.\n  Please use the NLTK Downloader to obtain the resource:\n\n  \u001b[31m>>> import nltk\n  >>> nltk.download('punkt')\n  \u001b[0m\n  For more information see: https://www.nltk.org/data.html\n\n  Attempted to load \u001b[93mtokenizers/punkt/english.pickle\u001b[0m\n\n  Searched in:\n    - 'C:\\\\Users\\\\16100/nltk_data'\n    - 'd:\\\\Anaconda3\\\\nltk_data'\n    - 'd:\\\\Anaconda3\\\\share\\\\nltk_data'\n    - 'd:\\\\Anaconda3\\\\lib\\\\nltk_data'\n    - 'C:\\\\Users\\\\16100\\\\AppData\\\\Roaming\\\\nltk_data'\n    - 'C:\\\\nltk_data'\n    - 'D:\\\\nltk_data'\n    - 'E:\\\\nltk_data'\n    - '/home/kesci/input/nltk_data3784/nltk_data'\n    - ''\n**********************************************************************\n",
     "output_type": "error",
     "traceback": [
      "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[1;31mLookupError\u001b[0m                               Traceback (most recent call last)",
      "\u001b[1;32m<ipython-input-16-af6af6f731b8>\u001b[0m in \u001b[0;36m<module>\u001b[1;34m\u001b[0m\n\u001b[0;32m      2\u001b[0m \u001b[1;32mfrom\u001b[0m \u001b[0mnltk\u001b[0m \u001b[1;32mimport\u001b[0m \u001b[0mdata\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m      3\u001b[0m \u001b[0mdata\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mpath\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mappend\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34m'/home/kesci/input/nltk_data3784/nltk_data'\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m----> 4\u001b[1;33m \u001b[0mprint\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mword_tokenize\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mtext\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m",
      "\u001b[1;32md:\\Anaconda3\\lib\\site-packages\\nltk\\tokenize\\__init__.py\u001b[0m in \u001b[0;36mword_tokenize\u001b[1;34m(text, language, preserve_line)\u001b[0m\n\u001b[0;32m    142\u001b[0m     \u001b[1;33m:\u001b[0m\u001b[0mtype\u001b[0m \u001b[0mpreserve_line\u001b[0m\u001b[1;33m:\u001b[0m \u001b[0mbool\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    143\u001b[0m     \"\"\"\n\u001b[1;32m--> 144\u001b[1;33m     \u001b[0msentences\u001b[0m \u001b[1;33m=\u001b[0m \u001b[1;33m[\u001b[0m\u001b[0mtext\u001b[0m\u001b[1;33m]\u001b[0m \u001b[1;32mif\u001b[0m \u001b[0mpreserve_line\u001b[0m \u001b[1;32melse\u001b[0m \u001b[0msent_tokenize\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mtext\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mlanguage\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m    145\u001b[0m     return [\n\u001b[0;32m    146\u001b[0m         \u001b[0mtoken\u001b[0m \u001b[1;32mfor\u001b[0m \u001b[0msent\u001b[0m \u001b[1;32min\u001b[0m \u001b[0msentences\u001b[0m \u001b[1;32mfor\u001b[0m \u001b[0mtoken\u001b[0m \u001b[1;32min\u001b[0m \u001b[0m_treebank_word_tokenizer\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mtokenize\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0msent\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
      "\u001b[1;32md:\\Anaconda3\\lib\\site-packages\\nltk\\tokenize\\__init__.py\u001b[0m in \u001b[0;36msent_tokenize\u001b[1;34m(text, language)\u001b[0m\n\u001b[0;32m    103\u001b[0m     \u001b[1;33m:\u001b[0m\u001b[0mparam\u001b[0m \u001b[0mlanguage\u001b[0m\u001b[1;33m:\u001b[0m \u001b[0mthe\u001b[0m \u001b[0mmodel\u001b[0m \u001b[0mname\u001b[0m \u001b[1;32min\u001b[0m \u001b[0mthe\u001b[0m \u001b[0mPunkt\u001b[0m \u001b[0mcorpus\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    104\u001b[0m     \"\"\"\n\u001b[1;32m--> 105\u001b[1;33m     \u001b[0mtokenizer\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mload\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34m'tokenizers/punkt/{0}.pickle'\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mformat\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mlanguage\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m    106\u001b[0m     \u001b[1;32mreturn\u001b[0m \u001b[0mtokenizer\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mtokenize\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mtext\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    107\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n",
      "\u001b[1;32md:\\Anaconda3\\lib\\site-packages\\nltk\\data.py\u001b[0m in \u001b[0;36mload\u001b[1;34m(resource_url, format, cache, verbose, logic_parser, fstruct_reader, encoding)\u001b[0m\n\u001b[0;32m    866\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    867\u001b[0m     \u001b[1;31m# Load the resource.\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 868\u001b[1;33m     \u001b[0mopened_resource\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0m_open\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mresource_url\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m    869\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    870\u001b[0m     \u001b[1;32mif\u001b[0m \u001b[0mformat\u001b[0m \u001b[1;33m==\u001b[0m \u001b[1;34m'raw'\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
      "\u001b[1;32md:\\Anaconda3\\lib\\site-packages\\nltk\\data.py\u001b[0m in \u001b[0;36m_open\u001b[1;34m(resource_url)\u001b[0m\n\u001b[0;32m    991\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    992\u001b[0m     \u001b[1;32mif\u001b[0m \u001b[0mprotocol\u001b[0m \u001b[1;32mis\u001b[0m \u001b[1;32mNone\u001b[0m \u001b[1;32mor\u001b[0m \u001b[0mprotocol\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mlower\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m \u001b[1;33m==\u001b[0m \u001b[1;34m'nltk'\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 993\u001b[1;33m         \u001b[1;32mreturn\u001b[0m \u001b[0mfind\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mpath_\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mpath\u001b[0m \u001b[1;33m+\u001b[0m \u001b[1;33m[\u001b[0m\u001b[1;34m''\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mopen\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m    994\u001b[0m     \u001b[1;32melif\u001b[0m \u001b[0mprotocol\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mlower\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m \u001b[1;33m==\u001b[0m \u001b[1;34m'file'\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    995\u001b[0m         \u001b[1;31m# urllib might not use mode='rb', so handle this one ourselves:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
      "\u001b[1;32md:\\Anaconda3\\lib\\site-packages\\nltk\\data.py\u001b[0m in \u001b[0;36mfind\u001b[1;34m(resource_name, paths)\u001b[0m\n\u001b[0;32m    699\u001b[0m     \u001b[0msep\u001b[0m \u001b[1;33m=\u001b[0m \u001b[1;34m'*'\u001b[0m \u001b[1;33m*\u001b[0m \u001b[1;36m70\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    700\u001b[0m     \u001b[0mresource_not_found\u001b[0m \u001b[1;33m=\u001b[0m \u001b[1;34m'\\n%s\\n%s\\n%s\\n'\u001b[0m \u001b[1;33m%\u001b[0m \u001b[1;33m(\u001b[0m\u001b[0msep\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mmsg\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0msep\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 701\u001b[1;33m     \u001b[1;32mraise\u001b[0m \u001b[0mLookupError\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mresource_not_found\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m    702\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    703\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n",
      "\u001b[1;31mLookupError\u001b[0m: \n**********************************************************************\n  Resource \u001b[93mpunkt\u001b[0m not found.\n  Please use the NLTK Downloader to obtain the resource:\n\n  \u001b[31m>>> import nltk\n  >>> nltk.download('punkt')\n  \u001b[0m\n  For more information see: https://www.nltk.org/data.html\n\n  Attempted to load \u001b[93mtokenizers/punkt/english.pickle\u001b[0m\n\n  Searched in:\n    - 'C:\\\\Users\\\\16100/nltk_data'\n    - 'd:\\\\Anaconda3\\\\nltk_data'\n    - 'd:\\\\Anaconda3\\\\share\\\\nltk_data'\n    - 'd:\\\\Anaconda3\\\\lib\\\\nltk_data'\n    - 'C:\\\\Users\\\\16100\\\\AppData\\\\Roaming\\\\nltk_data'\n    - 'C:\\\\nltk_data'\n    - 'D:\\\\nltk_data'\n    - 'E:\\\\nltk_data'\n    - '/home/kesci/input/nltk_data3784/nltk_data'\n    - ''\n**********************************************************************\n"
     ]
    }
   ],
   "source": [
    "from nltk.tokenize import word_tokenize\n",
    "from nltk import data\n",
    "data.path.append('/home/kesci/input/nltk_data3784/nltk_data')\n",
    "print(word_tokenize(text))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
